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Abstract 

Statistical moments of the intensity distributions are used as molecular descriptors. They are 
used as a basis for defining similarity distances between two model spectra. Parameters which 
carry the information derived from the comparison of shapes of the spectra and are related to the 
number of properties taken into account, are defined. 
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I. INTRODUCTION 

The basic values in statistical theory of spectra are moments of the intensity distribution 
1{E). In the case of discrete spectra the n— th statistical moment is defined as: 

max 
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where 2j is the intensity of the i— th line and Ei is the corresponding energy difference. If the 
spectral lines are sufficiently close to each other then the spectrum may be approximated by 
a continuous function. Then the n— th moment of the intensity distribution is defined as: 

/ l(E)E n dE 
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where C(E) is the range of the energy for which the integrand does not vanish. It is 
convenient to consider normalized spectra 1(E) = NT(E), where N = \ J T(E)dE , 

\C(E) J 

for which the area below the distribution function is equal to 1. Then 

M n = J I(E)E n dE. (3) 
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Convenient characteristics of the distributions may be derived from the properly scaled 
distribution moments. Moments normalized to the mean value equal to zero (M[ = 0) are 
referred to as the centered moments. The n — th centered moment reads: 

M' n = J I(E)(E - MtfdE. (4) 
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The moments, for which additionally the variance is equal to 1 (Mg = 1) are defined as 

(E - Mi, 
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dE. (5) 
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In this work the model spectra are approximated by continous functions taken as linear 
combinations of max unnormalized Gaussian distributions centered at e, with dispersions 



0£, defined by the parameters q = l/2of , z = 1,2,... max: 

max 

I(E)^Nj2a i exp[-c i (E-e i ) 2 ]. (6) 
i=i 

The normalization constant N is determined so that the zeroth moment of the distribution 
1(E) is equal to 1. 

The n-th moment of the distribution is equal to: 



max r 

M n = Nj2 J aiexp[- Ci (E - e i ) 2 ]E n dE. (7) 
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After some algebra we get the expressions for the moments as functions of the parameters 
describing the height (a,), the width (q) and the locations of the maxima (e^). In particular, 
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Mi = iV^e^J-, (8) 
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According to the so called principle of moments 0, Q, Q] we expect that if we identify the 
lower moments of two distributions, we bring these distributions to approximate identity. 
In this paper we apply this principle to the theory of molecular similarity. We assume 
that molecules have similar properties if their intensity distributions and, consequently the 
corresponding moments, are approximately the same. 

We propose that statistical moments of the intensity distributions can be treated as a 
new kind of molecular descriptors. A very clear meaning has the first moment, Mi, which 
describes the mean value of the distribution. In a similar sense a colour index has been 
introduced in astronomy J^J - its value allows us to compare spectra of different stars (it 
carries an information about molecules forming the star). The second centered moment, 
Mg, is the variance which gives the width of the distribution. Mg is the skewness coefficient 
which describes the asymmetry of the spectrum. The kurtosis coefficient M'l is connected 
to the excess of the distribution. 
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II. THEORY AND THE MODEL SPECTRA 



According to the method of moments, the shapes of two distributions are more similar 
if the number of identical moments is larger. Similarity of distributions in two- and three- 



moment approximations, m t 
has been analyzed in Refs. (5 



re context of the construction of envelopes of electronic bands, 



. Analogously, we define similarity parameters S]} l2 '" lk 



(k is the number of properties taken into account in the process of comparison) as a nor- 
malized information derived from a comparison of two distributions, referred to as a and 
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Here n is the total number of properties taken into account in the comparison of the two 
spectra and i k = 1,2, . . .n (k = 1, 2, ... n), correspond to a specific property. In particular, 
as the property number one (ik = 1) we take the first moment, as the property number 
two (ik = 2) we take the second centered moment, number three (ik = 3) - the asymmetry 
coefficient, number four (ik = 4) - the kurtosis coefficient. In this paper we take n = 4 and 
the corresponding similarity distances are defined as follows: 



Di = 1 — exp 
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The values of all the descriptors may vary from (identical properties) to 1. 
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We also define an additional parameter which may be evaluated if both spectra we are 
going to compare are available: 



This parameter is given by the integral of the module of the difference between the compared 
distributions and is not related to the moments. In the definition of T>, I' denotes the dis- 
tributions transformed so that their averages are the same. If we compare two distributions 
of the same shape then T> = 0. If two distributions do not overlap at all, then T> = 1. It 
is important to note that the distribution moments are defined as numbers attached to a 
given spectrum and the similarity distances D n are easily derived from the knowledge of 
these numbers. The parameter V, though it gives accurate information about similarity of 
two spectra, is rather cumbersome since it may be derived only if the complete spectra are 
given. 

If two model molecules (or rather their spectra) are identical, up to the accuracy deter- 
mined by the considered properties, then all S l k lt2 '" lk are equal to 0. The maximum value of 
gni2-*h j g y and corresponds to two spectra with no common features within the considered 
set of properties. 

The result of a comparison of two different objects depends not only on the number of 
properties taken into account but also on their choice [i\ or i 2 or . . . i n ). Therefore the 
quantities S]} t2 '" tk defined in Eq. fl"2|) - (|15J) should be averaged by taking all combinations 
of the indices ik- Thus, we define parameters Sk as the appropriate averages of S t k ll2 '" lk : 
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In particular, in our case: 
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FIG. 1: Two intensity distributions (solid and dashed lines) and the corresponding similarity 
parameters S4 (sequence I). 

III. RESULTS AND DISCUSSION 



In order to illustrate our approach, we took model spectra consisting of two bands, i.e. 
having two maxima (max = 2): 



P(E) = N 



di exp 



-c 1 (E-e 1 ) 



+ a2 exp 



-c 2 (E-e 2 ) 



(26) 



where 7 = {ci, a 1; ei, c 2 , a 2 , e 2 }. In order to see relations between molecular spectra, defined 
in Eq. 1)26)1 and the similarity indices defined in Eqs. (|TT)j) - (|2*nj) and (|2*2|) - in a simple 
and transparent way, we study three sequences of spectra, where in each sequence only one 
parameter has been modified: c 2 in sequence I, a 2 in sequence II, e 2 in sequence III. 

(a) Sequence I corresponds to the situation when a symmetric spectrum consisting of two 
identical Gaussian distributions shifted relative to each other by e 2 — e\ = 1 (01 = a 2 = 
1.0, E\ = 1.2, e 2 = 2.2, Ci = c 2 = 5.0) transforms to a distribution in which the width 
of one of the Gaussians changes due to the change of the parameter c 2 = 5.0 + 5c, 
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FIG. 2: Two intensity distributions (solid and dashed lines) and the corresponding similarity 
parameters S4 (sequence II). 

where 5c G (0; 19.8). Then, we compare shapes of intensity distributions I a (E) and 
IP(E), where a = {5.0, 1.0, 1.2, 5.0, 1.0, 2.2}, (3 = {5.0, 1.0, 1.2, 5.0 + 5c, 1.0, 2.2}. 

In Fig. 1 spectra corresponding to 5c = (solid lines) and 5c > (dashed lines) are 
compared. In each case values of 5c and £4 are given. A correlation between these two 
numbers and between shapes of the spectra is clearly seen. The value of S4 increases 
when the two spectra become less similar. 

(b) Sequence II corresponds to the same symmetric spectrum as before (aq = a 2 = 1.0, 
ei = 1.2, e 2 = 2.2, Ci = c 2 = 5.0) transforming to the distributions in which the 
height of one of the Gaussians changes due to the changes of a 2 = 1.0 + 5a, where 
5a E (0;9.9). Then, we compare shapes of intensity distributions I a (E) and I^{E), 
where a = {5.0, 1.0, 1.2, 5.0, 1.0, 2.2}, (3 = {5.0, 1.0, 1.2, 5.0, 1.0 + 5a, 2.2}. 

In Fig. 2 spectra corresponding to 5a = (solid lines) and 5a > (dashed lines) are 
compared. In each case values of 5a and S4 are given. The conclusions are similar to 
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FIG. 3: Two intensity distributions (solid and dashed lines) and the corresponding similarity 
parameters S4 (sequence III). 

those in the case of Fig. 1. 

(c) Sequence III corresponds to a similar situation as before, except that the maxima in 
I a are shifted by 1.5 rather than by 1 (ai = a<i = 1.0, e\ = 1.2, €2 = 2.7, c\ = ci = 
5.0). I a transforms to the distribution I@ for which one of the gaussian distribution 
changes the location of the second maximum e 2 = 2.7 — 5e, where 5e e (0;0.99). 
Then, we compare shapes of intensity distributions I a (E) and I^{E), where a = 
{5.0, 1.0, 1.2, 5.0, 1.0, 2.7}, (3 = {5.0, 1.0, 1.2, 5.0, 1.0, 2.7 - 5e}. 

In Fig. 3 spectra corresponding to 5e = (solid lines) and 5e > (dashed lines) are 
compared. In each case values of 5e and S4 are given. The conclusions are similar to 
those in the cases described by Figs. 1 and 2. 

The molecular descriptors [statistical moments of I^(E)] are plotted in Fig. 4 versus 5c 
(sequence I), 8a (sequence II), 8e (sequence III). In case of sequence I, it is clear that the 
considered change of the spectrum leads to a decrease of the first moment (the intensity is 



shifted towards smaller energies). The dispersion of the whole distribution also decreases 
(Mg). The asymmetry of the spectrum changes from totally symmetric (M3 = 0) to asym- 
metric (M3 7^ 0). The kurtosis coefficient M'( changes as it is presented, in a non-monotonic 
way. It is interesting that for M3 and M'l minima appear for 5c 7^ 0. In the case of sequence 
II, with an increase of 8a the first moment is shifted towards higher values and the disper- 
sion of the whole spectrum decreases. The asymmetry of the spectrum decreases and the 
kurtosis parameter increases. In case of sequence III, shifting the second maximum €2 to the 
smaller energies results in a distribution with one maximum instead of two and the intensity 
is shifted towards smaller energies. In consequence the first moment decreases. The whole 
distribution becomes more narrow and, consequently, we observe decreasing of M' 2 . For all 
5e distributions are symmetric (M3 = 0) and the kurtosis parameter increases. 

Fig. 5 presents D defined in Eqs. (fTfij) - (J2Uj) . In the case of sequence I, if 5c = 0, 
we compare two identical distributions and all the descriptors are equal to zero. The most 
sensitive to the changes of 5c is in this case V, contrary to the other descriptors which are 
nearly constant. The two distributions are rather similar in sense of the average value, of 
the width, of the asymmetry and of the kurtosis (the values of D\, D2, D3, D± are small and 
the corresponding curves cross). In case of sequence II, we observe small values of D 2 and 
D\, that indicates large similarity of the two distributions in sense of the width and of the 
average values. For small values of 5a we observe crossings between D 3 ,D 4 and T>. The 
most sensitive to the changes of 5a is D4. In case of sequnce III, the behaviour of D\ and D2 
is very similar. Both spectra are totally symmetric (M^ a = M3 = 0). Therefore D$ = 
for all 5e. D4 and T> cross and change very substantially contrary to D\ and D2 which are 
nearly constant. 

Fig. 6 presents similarity parameters Sk for k — 1, 2, 3, 4 [Eqs. (j22j) - (EH)]- Small values 
of S correspond to high similarity of the model spectra. In particular, if 5c = (sequence 
I) then Sk = for all k. As we can see, S is the smallest for k = 1 and increases with 
increasing k. Analogously to the sequence I, S\ < S2 < S3 < S4 for all 5a (sequence II) and 
for all 8e (sequence III). Intuitively, we expect that two systems which are similar to each 
other when only one property is considered may exhibit more differences if we look at the 
systems in more detail, taking into account more properties. These features can be seen in 
Fig. 6. 
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FIG. 5: Parameters D as functions of 5c (sequence I), Sa (sequence II), 5e (sequence III). 

IV. CONCLUSIONS 

Statistical moments describe in an adequate way the degree of similarity of two-band 
model spectra. Though the mathematical model describing shapes of the spectra is rela- 
tively simple, it reflects the behaviour of real molecular spectra. Three parameters: c, a and 
e, influence different aspects of the shapes of spectra and the resulting values of D. In par- 
ticular, parameters D and corresponding S are the smallest if a and e are constant (sequence 
I). In these cases spectra are only slightly modified by 8c (Fig. 1). Larger differences of 
spectra are caused by parameter 5a , while c and e are constant (sequence II). The influence 
of e on spectra is also large (sequence III). The additional parameter T> introduces some 
independent information about spectra. Contrary to the case of single-band model spectra 
studied in our previous paper <J], where its behaviour is very similar to D 4 , here it apeears 
to be the most sensitive index (sequence I). 
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FIG. 6: Parameters Sk as functions of 5c (sequence I), 5a (sequence II), 5e (sequence III). 



Summarizing, we demonstrated that spectral density distribution moments can be used 
for denning similarity indices of spectra. By grouping molecules according to the spectral 
density distribution moments we can get a chance to discover new characteristics in the 



field of molecular similarity anc 



computational toxicology [Hj 
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in particular it may be a tool for studies in the area of 

3. 
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